>> Steps to Build a Tokenizer

   1. Normalization: This is where we clean up the text. It can involve removing spaces, accents, or normalizing text to a standard Unicode format.
   2. Pre-tokenization: This step involves splitting the input text into individual words or smaller chunks.
   3. Tokenization: Here, the pre-tokenized words are converted into a sequence of tokens that the model can understand.
   4. Post-processing: Special tokens are added, and attention masks or token type IDs are generated to help the model process the input correctly.

>> Components of a Tokenizer

   The `🤗 Tokenizers` library provides different components that you can mix and match to build your tokenizer. These include:

   - Normalizers: These handle text normalization (e.g., removing accents).
   - Pre-tokenizers: These split text into words or subwords.
   - Models: These define how the tokenizer breaks down the text (e.g., using BPE, WordPiece).
   - Trainers: These are used to train the tokenizer on a specific dataset.
   - Post-processors: These add special tokens and adjust the output for the model.
   - Decoders: These reverse the tokenization process, turning tokens back into text.

>> Acquiring a Corpus

   To train the tokenizer, we need a text corpus. For example, let's use the `WikiText-2` dataset:

python code

'''
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]
'''

This function `get_training_corpus()` provides batches of text to train the tokenizer.

>> Building a WordPiece Tokenizer

  To build a tokenizer using the `🤗 Tokenizers` library, start by creating a `Tokenizer` object:

python code

'''
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
'''

Here, `[UNK]` is a token for unknown words.

>> Normalization

Next, set up normalization. For example, using the BERT normalizer:

python code

'''
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
'''

Or you can create a custom normalizer:

python code

'''
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)
'''

>> Pre-tokenization

Now, let's set up pre-tokenization:

python code

'''
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
'''

Or use a simpler pre-tokenizer:

python code

'''
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
'''

>> Training the Tokenizer

   To train the tokenizer, you need a trainer. For WordPiece, set it up like this:

python code

'''
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
'''

You can also train using text files:

python code

'''
tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)
'''

>> Post-processing

   Add post-processing to handle special tokens like `[CLS]` and `[SEP]`:

python code

'''
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)
'''

>> Decoding

   Finally, add a decoder to convert tokens back to text:

python code

'''
tokenizer.decoder = decoders.WordPiece(prefix="##")
'''

>> Saving and Loading the Tokenizer

   Save your tokenizer:

python code

'''
tokenizer.save("tokenizer.json")
'''

And reload it later:

python code

'''
new_tokenizer = Tokenizer.from_file("tokenizer.json")
'''

This process allows you to build a tokenizer from scratch, giving you control over each step.
